(Lin et al. 2018) (Durinck et al. 2009) (Durinck et al. 2005) (Davis and Meltzer 2007) (Huber et al. 2015) (R Core Team 2022) (Sanghi, Gruber, and Metwally 2021) (Law et al. 2016) (Robinson, DJ, and Smyth 2010) (McCarthy, Y, and Smyth 2012) (Chen, Lun, and Smyth 2016) (McCarthy, Y, and Smyth 2012) (Morgan 2021) (Kolberg et al. 2020) (Gu, Eils, and Schlesner 2016)
The data set I selected from GEO (Gene Expression Omnibus) was “APOE4 Causes Widespread Molecular and Cellular Alterations Associated with Alzheimer’s Disease Phenotypes in Human iPSC-Derived Brain Cell Types” conducted by the lab in MIT on Illumina HiSeq 2000 palteform with GEO identifier GSE102956. This research compares the three controls(APOE3) and three tests(APOE4) iPSC-Derived Brain Cell to find whether APOE4 variant of APOE gene will cause Alzheimer’s Disease. In the previous assignment, the data set was first cleaned by checking duplication and removing the low gene counts and outliers. Then the mixed NCBI gene ids and HGNC ids are merged and mapped into HGNC id. Finally, the data was normalized by TMM to be better used in the future.
Packages prep
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
if (!requireNamespace("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery", force = TRUE)
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("magrittr", quietly = TRUE))
BiocManager::install("magrittr")
if (!requireNamespace("kableExtra", quietly = TRUE))
BiocManager::install("kableExtra")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
if (!requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
library("BiocManager")
library("GEOmetadb")
library("GEOquery")
library("knitr")
library("edgeR")
library("biomaRt")
library("magrittr")
library("kableExtra")
library("ComplexHeatmap")
library("circlize")
library("limma")
library("gprofiler2")
Load normalized data from A1.
normalized_count_data <- read.table(file=file.path(getwd(), "data", "nomolized_gene"), header = TRUE,sep = ",",stringsAsFactors = FALSE,check.names=FALSE)
colnames(normalized_count_data)[1] <- "GeneID"
Create a heat map to visualize our data.
heatmap_matrix <- normalized_count_data[, 2:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GeneID
head(heatmap_matrix)
## NEU_E3.1 NEU_E3.2 NEU_E3.3 NEU_E4.1 NEU_E4.2 NEU_E4.3
## 729737 1.229206 1.142812 0.8316146 1.502754 0.3241103 0.7077164
## LINC01128 41.301326 55.426388 57.9754153 66.443200 65.9024298 61.8072295
## LINC02593 6.146031 3.428436 2.3760416 3.756885 3.1330663 3.5385818
## SAMD11 93.050905 28.798866 26.9680723 34.241326 24.4163100 24.4162147
## NOC2L 82.725573 92.224939 93.1408311 80.719364 81.4597247 81.7412406
## KLHL17 13.890029 13.485183 11.4049997 10.411939 9.5072358 8.2566910
#Scale each row
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0,
max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(name = "Gene Count Heatmap",
as.matrix(heatmap_matrix),
show_row_dend = TRUE,show_column_dend = TRUE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap
Use Limma package to see our data clustering colored by cell type since cell type is the only control in my data set.
limma::plotMDS(main = "Multidimensional Scaling Plot",
heatmap_matrix,
col = c(rep("darkgreen",3), rep("blue",3)))
creates a design matrix
samples <- data.frame(cell_type = c(rep("NEU_E3", 3), rep("NEU_E4", 3)))
rownames(samples) <- colnames(normalized_count_data)[2:7]
model_design <- model.matrix(~ samples$cell_type )
kable(model_design[1:6,], type="html")
| (Intercept) | samples$cell_typeNEU_E4 |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
Create our data matrix
expressionMatrix <- as.matrix(normalized_count_data[,2:7])
rownames(expressionMatrix) <- normalized_count_data$GeneID
colnames(expressionMatrix) <- colnames(normalized_count_data)[2:7]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
#Fit our data to the above model
fit <- lmFit(minimalSet, model_design)
Apply empirical Bayes to compute differential expression for the above described model. We use Benjamini & Hochberg model as multiple hypothesis correction method, since it is the most commonly used and works very well in our dataset.
fit2 <- eBayes(fit, trend = TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
output_hits <- output_hits[,-2]
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
head(output_hits)
## GeneID logFC AveExpr t P.Value adj.P.Val
## 25 100287846 13.803584 6.901792 15.459182 1.163843e-06 0.00849946
## 9680 RRAGB 82.255531 42.653304 15.265943 1.267914e-06 0.00849946
## 3623 ERC1 201.661568 313.959011 12.529843 4.825670e-06 0.02156592
## 7932 PCDHA6 9.085897 6.765306 10.808858 1.295026e-05 0.04340602
## 10220 SLC16A2 139.686820 212.644979 9.994541 2.173238e-05 0.05146164
## 7222 NDUFA10 -56.200447 90.237582 -9.702875 2.640008e-05 0.05146164
## B
## 25 -2.594330
## 9680 -2.596579
## 3623 -2.639545
## 7932 -2.683062
## 10220 -2.710972
## 7222 -2.722503
Number of gene pass the threshold p-value < 0.05. We use p-value less that 0.05 by convention. Also the number of genes that are better than this p-value is resonable.
length(which(output_hits$P.Value < 0.05))
## [1] 680
Number of gene pass correction?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 4
Plot the result and gene APOE
simple_model_pvalues <- data.frame(GeneID = output_hits$GeneID, simple_pvalue=output_hits$P.Value)
simple_model_pvalues$colour <- "gray"
simple_model_pvalues$colour[simple_model_pvalues$simple_pvalue < 0.05] <- "orange"
simple_model_pvalues$colour[simple_model_pvalues$GeneID == "APOE"] <- "red"
plot(simple_model_pvalues$simple_pvalue,
col = simple_model_pvalues$colour,
xlab = "simple model p-values",
main="Simple Limma MA Plot")
points(which(simple_model_pvalues$GeneID == "APOE"), simple_model_pvalues[which(simple_model_pvalues$GeneID == "APOE"),2], pch=20,
col="red", cex=1.5)
legend(0,1,legend=c("Significant","Insif", "APOE"),
fill=c("orange","grey", "red"),cex = 0.7)
#This plot is only one dimensional because there is only one group of control versus one group of test. Thus we only have a simple model with one tyoe of grouping.
#The APOE's position on the graph makes sense because APOE genes should not be expressed very differently. Instead APOE gene variant will induce other genes express differently, which is the purpose of our study.
Plot the heat map for only significant p-value genes.
top_hits <- output_hits$GeneID[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(name = "Differentialized P-Value Heapmap",
as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
#The result looks very promising with controls and test grouped together because controls have the same APOE3 gene and test have the same APOE4 genes.
Limma package has already given a reasonable and informative result. I will not further dilapidated use EdgeR package since Limma and EdgeR are similar in the core and written by the same authors.
How many genes are up regulated.
length(which(output_hits$P.Value < 0.05
& output_hits$logFC > 0))
## [1] 499
How many genes are down regulated?
length(which(output_hits$P.Value < 0.05
& output_hits$logFC < 0))
## [1] 181
Create thresholed lists of genes
output_hit_with_rank <- output_hits
output_hit_with_rank[,"rank"] <- -log(output_hit_with_rank$P.Value, base = 10) * sign(output_hit_with_rank$logFC)
output_hit_with_rank <- output_hit_with_rank[order(output_hit_with_rank$rank), ]
upregulated_genes <- output_hits$GeneID[
which(output_hits$P.Value < 0.05
& output_hits$logFC > 0)]
downregulated_genes <- output_hits$GeneID[
which(output_hits$P.Value < 0.05
& output_hits$logFC < 0)]
write.table(x=upregulated_genes,
file=file.path("data","upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=data.frame(genename= output_hit_with_rank$GeneID,F_stat= output_hit_with_rank$rank),
file=file.path("data","ranked_genelist.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
Use G:Profiler to perform gene enrichment analysis becaise it has both web interface and R package that are very handy. GO:BP (2022-12-04), Reactome(2022-12-28), and WikiPathways(2022-12-10) for annotation because they are comprehensive and cover most human biological pathways.
Upregulated genes
GEA_upragulated <- gprofiler2::gost(query = upregulated_genes,
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
#we only want the term size <200
GEA_upragulated_top <- data.frame(
term_name = GEA_upragulated$result$term_name[GEA_upragulated$result$term_size < 200 &
GEA_upragulated$result$term_size > 1],
term_id = GEA_upragulated$result$term_id[GEA_upragulated$result$term_size < 200 &
GEA_upragulated$result$term_size > 1],
source = GEA_upragulated$result$source[GEA_upragulated$result$term_size < 200 &
GEA_upragulated$result$term_size > 1]
)
knitr::kable(head(GEA_upragulated_top, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| neuron migration | GO:0001764 | GO:BP |
| axon guidance | GO:0007411 | GO:BP |
| neuron projection guidance | GO:0097485 | GO:BP |
| synapse assembly | GO:0007416 | GO:BP |
| action potential | GO:0001508 | GO:BP |
| regulation of microtubule depolymerization | GO:0031114 | GO:BP |
| central nervous system neuron differentiation | GO:0021953 | GO:BP |
| regulation of microtubule cytoskeleton organization | GO:0070507 | GO:BP |
| regulation of microtubule polymerization or depolymerization | GO:0031110 | GO:BP |
| membrane depolarization during action potential | GO:0086010 | GO:BP |
length(GEA_upragulated_top$term_name)
## [1] 82
# 82 genesets are returned for upregulated genes with 1 < term size < 200, and p value < 0.05
Plot visualization
gprofiler2::gostplot(GEA_upragulated) %>% plotly::layout(title = "Upregulated genes plot", font = list(size = 10))
Downregulated genes
GEA_downragulated <- gprofiler2::gost(query = downregulated_genes,
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
#we only want the term size <200
GEA_downragulated_top <- data.frame(
term_name = GEA_downragulated$result$term_name[GEA_downragulated$result$term_size < 200 &
GEA_downragulated$result$term_size > 1],
term_id = GEA_downragulated$result$term_id[GEA_downragulated$result$term_size < 200 &
GEA_downragulated$result$term_size > 1],
source = GEA_downragulated$result$source[GEA_downragulated$result$term_size < 200 &
GEA_downragulated$result$term_size > 1]
)
knitr::kable(head(GEA_downragulated_top, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| Response of EIF2AK1 (HRI) to heme deficiency | REAC:R-HSA-9648895 | REAC |
| 2q37 copy number variation syndrome | WP:WP5224 | WP |
| Photodynamic therapy-induced unfolded protein response | WP:WP3613 | WP |
length(GEA_downragulated_top$term_name)
## [1] 3
# 3 genesets are returned for downregulated genes with 1 < term size < 200, and p value < 0.05
Plot visualization
gprofiler2::gostplot(GEA_downragulated) %>% plotly::layout(title = "Downregulated genes plot", font = list(size = 10))
All genes
all_genes <- gprofiler2::gost(query = output_hits$GeneID,
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
#we only want the term size <200
all_genes_top <- data.frame(
term_name = all_genes$result$term_name[all_genes$result$term_size < 200 &
all_genes$result$term_size > 1],
term_id = all_genes$result$term_id[all_genes$result$term_size < 200 &
all_genes$result$term_size > 1],
source = all_genes$result$source[all_genes$result$term_size < 200 &
all_genes$result$term_size > 1]
)
knitr::kable(head(all_genes_top, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| rRNA processing | GO:0006364 | GO:BP |
| ribonucleoprotein complex assembly | GO:0022618 | GO:BP |
| regulation of mRNA processing | GO:0050684 | GO:BP |
| mitochondrial gene expression | GO:0140053 | GO:BP |
| cytoplasmic translation | GO:0002181 | GO:BP |
| regulation of RNA splicing | GO:0043484 | GO:BP |
| neuron projection extension | GO:1990138 | GO:BP |
| regulation of macroautophagy | GO:0016241 | GO:BP |
| vacuolar transport | GO:0007034 | GO:BP |
| mitochondrial translation | GO:0032543 | GO:BP |
length(all_genes_top$term_name)
## [1] 1423
# 1423 genesets are returned for all genes with 1 < term size < 200, and p value < 0.05
Plot visualization
gprofiler2::gostplot(all_genes) %>% plotly::layout(title = "All Genes genes plot", font = list(size = 10))
From the above three over-representations analysis, we are able to see that there are around 30 times more upregulated gene sets than down regulated gene sets, which suggests that the upregulated genes are more actively involved in biological pathways than downregulated genes. The overall gene data sets suepasses both up regulated and down regulated gene sets significantly suggests that the genes other than up regulated and down regulated also plays a crutial part in the overall pathways.
The over-representation results amazingly support the mechanism discussed in the original paper. In the over-representation analysis above, we can see that up regulated genes are heavily involved in neural pathways, in which the top ten results are all related to neural transmission and neuron development. In the paper, the author concluded that the different variants of APEO gene will have an overall effete of neuron gene expression level, which thus cause Alzheimer’s disease.
In the paper, the result showed that “APOE4 neurons exhibited increased synapse number and elevated Ab42 secretion relative to isogenic APOE3 cells” which perfectly supports our result. Because the upregulated genes are the genes that are significantly more in APOE4 neurons than APOE3 neurons. These upregulated genes are heavily involve in the synapse development pathways which is exactly what the papers illustrated.